Focusing on the issue that feature selection for the usually encountered large scale data sets in the "big data" is too slow to meet the practical requirements, a fast feature selection algorithm for unsupervised massive data sets was proposed based on the incremental absolute reduction algorithm in traditional rough set theory. Firstly, the large scale data set was regarded as a random object sequence and the candidate reduct was set empty. Secondly, random object was one by one drawn from the large scale data set without replacement; next, each random drawn object was checked if it could be distinguished with the other objects in the current object set and then merged with current object set, if the new object could not be distinguished using the candidate reduct, a new attribute that can distinguish the new object should be added into the candidate reduct. Finally, if successive I objects were distinguishable using the candidate reduct, the candidate reduct was used as the reduct of the large scale data set. Experiments on five unsupervised large-scale data sets demonstrated that a reduct which can distinguish no less than 95% object pairs could be found within 1% time needed by the discernibility matrix based algorithm and incremental absolute reduction algorithm. In the experiment of the text topic mining, the topic found by the reducted data set was consistent with that of the original data set. The experimental results show that the proposed algorithm can obtain effective reducts for large scale data set in practical time.
In the era of big data, research in topic evolution is mostly based on the classical probability topic model, the premise of word bag hypothesis leads to the lack of semantic in topic and the retrospective process in analyzing evolution. An online incremental feature ontology based topic evolution algorithm was proposed to tackle these problems. First of all, feature ontology was built based on word co-occurrence and general WordNet ontology base, with which the topic in text stream was modeled. Secondly, a text stream topic matrix construction algorithm was put forward to realize online incremental topic evolution analysis. Finally, a text topic ontology evolution diagram construction algorithm was put forward based on the text steam topic matrix, and topic similarity was computed using sub-graph similarity calculation, thus the evolution of topics in text stream was obtained with time scale. Experiments on scientific literature showed that the proposed algorithm reduced time complexity to O(nK+N), which outperformed classical probability topic evolution model, and performed no worse than sliding-window based Latent Dirichlet Allocation (LDA). With ontology introduced, as well as the semantic relations, the proposed algorithm can demonstrate the semantic feature of topics in graphics, based on which the topic evolution diagram is built incrementally, thus has more advantages in semantic explanatory and topic visualization.